-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add stateless redis disruptor proposal #331
base: main
Are you sure you want to change the base?
Conversation
docs/01-development/design-docs/003-stateless-redis-disruptor.md
Outdated
Show resolved
Hide resolved
683839e
to
8079740
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in my comments, I would prefer that we explore the stateful proxy as a goal and leave the stateless proxy as a PoC stage toward that goal as part of the implementation plan instead of as an alternative implementation.
The rationale for this is that the project should not aim for simple implementations but for meaningful developers experiences, and given the limitations of the stateless proxy, I'm not sure we would like to offer it to the users.
|
||
## Background | ||
|
||
Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure. | |
Caching services like Redis are a common way to improve the performance of applications, but sometimes it is difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as an increase in latency, or unexpected miss rate increase can affect a system in significant ways and lead to catastrophic failure. |
|
||
## Problem statement | ||
|
||
A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering. | |
A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of applications when using caching. In this scenario, the application is responding well to a certain load thanks to a warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering. |
|
||
## Goals | ||
|
||
Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults: | |
Add Redis fault injection functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults: |
|
||
Without the requirement of being able to correlate responses with the requests that originated them, a RESP proxy can be made stateless. This reduces the complexity at the cost of, as expected, not being to correlate those responses. However, it should still be possible to meet the goals above with an stateless proxy. | ||
|
||
A stateless RESP proxy accepts connections from Redis clients. It will read messages sent by clients, parse them, and decide if any action is necessary, such as modifying the request, or delaying it. It simply passes through responses from the server back to the client, without needing to decode them. A stateless proxy always needs to forward requests, modified or not, to the upstream server. As it is not aware of the flow of responses, it should be compatible with server pushes without needing any additional logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand from the description of the fault injection in the following sections, I think this approach is rather limiting:
- Having to change the keys in the upstream requests instead of intercepting and modifying the responses. This may not have any side effects, but still, I found it "hacky"
- Not allowing latency per command, but per message (i understand, a message can have multiple commands)
I would like to evaluate the complexity of an alternative approach that is aware of the responses.
|
||
### Advantages | ||
|
||
- Easier to implement and less error-prone than a stateful proxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it is valid to implement a PoC using the stateless approach, but I would prefer that we address a full-fledged implementation in this design document.
|
||
#### Disadvantages | ||
|
||
- Code is more complex, requiring more development time and increasing the surface for bugs to appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when this is a valid concern, I think we should explore this option and leave the stateless proxy as a PoC of the final goal.
Regarding the complexity of implementing the Redis protocol, we can explore and learn from existing projects: |
Description
This PR adds a proposal to add a redis disruptor. Details about the use cases and implementation proposals can be found in the added design doc.